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Abstract 

We consider solving multi-objective optimization problems in a distributed manner by a network 
of cooperating and learning agents. The problem is equivalent to optimizing a global cost that is the 
sum of individual components. The optimizers of the individual components do not necessarily coincide 
and the network therefore needs to seek Pareto optimal solutions. We develop a distributed solution that 
relies on a general class of adaptive diffusion strategies. We show how the diffusion process can be 
represented as the cascade composition of three operators: two combination operators and a gradient 
descent operator. Using the Banach fixed-point theorem, we establish the existence of a unique fixed 
point for the composite cascade. We then study how close each agent converges towards this fixed point, 
and also examine how close the Pareto solution is to the fixed point. We perform a detailed mean-square 
error analysis and establish that all agents are able to converge to the same Pareto optimal solution 
within a sufficiently small mean-square-error (MSE) bound even for constant step-sizes. We illustrate 
one application of the theory to collaborative decision making in finance by a network of agents. 

Index Terms 

Distributed optimization, network optimization, diffusion adaptation, Pareto optimality, mean-square 
performance, convergence, stability, fixed point, collaborative decision making. 

I. Introduction 

We consider solving a multi-objective optimization problem in a distributed manner over a netw^ork of 
cooperative learners (see Fig. [T]). Each agent k is associated w^ith an individual cost function J^{w); 
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and each of these costs may not be minimized at the same vector w*^. As such, we need to seek a solution 
that is "optimal" in some sense for the entire network. In these cases, a general concept of optimality 
known as Pareto optimality is useful to characterize how good a solution is. A solution is said to be 
Pareto optimal if there does not exist another vector w that is able to improve (i.e., reduce) any particular 
cost, say, J^{w), without degrading (increasing) some of the other costs {J^(w)}i^k' To illustrate the 
idea of Pareto optimality, let 

O ^ {{J^w), . . . , JUw)) : ^ G W} C (1) 

denote the set of achievable cost values, where W denotes the feasible set. Each point P G O represents 
attained values for the cost functions {Jj^iw)} at a certain w ^ W. Let us consider the two-node case 
(N = 2) shown in Fig. [2} where the shaded areas represent the set O for two situations of interest. In Fig. 




J9M 



Fig. 1. A network of N cooperating agents; a cost function Jk{u)) is associated with each node k. The set of neighbors of 
node k is denoted by jVk (including node k itself); this set consists of all nodes with which node k can share information. 



2(a) both Ji{w) and Jfl^) achieve their minima at the same point P = (Jf (it;^), Jfl^^))' where is 



the common minimizer. In comparison, in Fig. 2(b) Ji{w) attains its minimum at point Pi, while J2 (^) 



attains its minimum at point P2, so that they do not have a common minimizer. Instead, all the points 
on the heavy red curve between points Pi and P2 are Pareto optimal solutions. For example, starting at 
point A on the curve, if we want to reduce the value of Ji{w) without increasing the value of J2 (^)' 
then we will need to move out of the achievable set O. The alternative choice that would keep us on the 
curve is to move to another Pareto optimal point P, which would however increase the value of (^)- 
In other words, we need to trade the value of (^) Jfiw). For this reason, the curve from Pi to 
P2 is called the optimal tradeoff curve (or optimal tradeoff surface if > 2) |2, p. 183]. 
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Fig. 2. Optimal and Pareto optimal points for the case N = 2: (Left) P denotes the optimal point where both cost functions 
are minimized simultaneously and (Right) Pareto optimal points lie on the red boundary curve. 

To solve for Pareto optimal solutions, a useful scalarization technique is often used to form an aggregate 
cost function that is the weighted sum of the component costs as follows: 



N 



(2) 



1=1 



where tt/ is a positive weight attached with the /th cost. It was shown in f2^, pp. 178-1 80] that the minimizer 
of ([2]) is Pareto optimal for the multi-objective optimization problem. Moreover, by varying the values 
of {tt^}, we are able to get different Pareto optimal points on the tradeoff curve. Observing that we can 
always define a new cost Ji{w) by incorporating the weighting scalar tt/, 



Jl{w)^TllJf{w) 

it is sufficient for our future discussions to focus on aggregate costs of the following form: 



(3) 




(4) 

If desired, we can also add constraints to problem Q. For example, suppose there is additionally some 
constraint of the form p^w < at node k, where pj^ is M x 1 and bk is a scalar. Then, we can consider 
using barrier functions to convert the constrained optimization problem to an unconstrained problem 
[[sj. For example, we can redefine each cost Jk{w) to be Jki'w) ^ Jk{^) + 4^{Pk^ ~ ^k)^ where (j){x) 
is a barrier function that penalizes values of w that violate the constraint. Therefore, without loss of 
generality, we shall assume W = and only consider unconstrained optimization problems. Moreover, 
we shall assume the {Ji{w)} are differentiable and, for each given set of positive weights {tt/}, the cost 
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jgiobj^^^ in ([2]) or (|4]) is strongly convex so that the minimizer is unique [41. Note that the new cost 
Ji{w) in ([3]) depends on tti so that the that minimizes J^^^^{w) in (|4]) also depends on {tti}. 

One of the most studied approaches to the distributed solution of such optimization problems is the 
incremental approach — see, e.g., |j5|-|[T2|. In this approach, a cyclic path is defined over the nodes 



and data are processed in a cyclic manner through the network until optimization is achieved. However, 
determining a cyclic path that covers all nodes is generally an NP-hard problem [ jT3| and, in addition, 
cyclic trajectories are vulnerable to link and node failures. Another useful distributed optimization 
approach relies on the use of consensus strategies [[T4|-[[20|. In this approach, vanishing step-size 
sequences are used to ensure that agents reach consensus and agree about the optimizer in steady-state. 
However, in time-varying environments, diminishing step-sizes prevent the network from continuous 
learning; when the step-sizes die out, the network stops learning. 

In pH , we generalized our earlier work on adaptation and learning over networks [ |22| , [ [23| and 



developed diffusion strategies that enable the decentralized optimization of global cost functions of the 
form (|4]). In the diffusion approach, information is processed locally at the nodes and then diffused 
through a real-time sharing mechanism. In this manner, the approach is scalable, robust to node and 
link failures, and avoids the need for cyclic trajectories. In addition, compared to the aforementioned 



consensus solutions (such as those in [ [T6| , [ |T9| , p4|), the diffusion strategies we consider here employ 
constant (rather than vanishing) step-sizes in order to endow the resulting networks with continuous 
learning and tracking abilities. By keeping the step-sizes constant, the agents are able to track drifts in 
the underlying costs and in the location of the Pareto optimal solutions. One of the main challenges in 
the ensuing analysis becomes that of showing that the agents are still able to approach the Pareto optimal 
solution even with constant step-sizes; in this way, the resulting diffusion strategies are able to combine 
the two useful properties of optimality and adaptation. 



In 1 21 1, we focused on the important case where all costs {Ji{w)} share the same optimal solution 



(as was the case with Fig. |2(a)| ); this situation arises when the agents in the network have a common 
objective and they cooperate to solve the problem of mutual interest in a distributed manner. Examples 
abound in biological networks where agents work together, for example, to locate food sources or 



evade predators |25|, and in collaborative spectrum sensing p6| , system identification p7| , and learning 
applications flEj . In this paper, we develop the necessary theory to show that the same diffusion approach 
(described by ([9])-([T0]) below) can be used to solve the more challenging multi-objective optimization 
problem, where the agents need to converge instead to a Pareto optimal solution. Such situations are 



common in the context of multi-agent decision making (see, e.g., |3 1 and also Sec. IV where we discuss 



August 14, 2012 DRAFT 



5 



one application in the context of collaborative decision in finance). To study this more demanding scenario, 
we first show that the proposed diffusion process can be represented as the cascade composition of three 
operators: two combination (aggregation) operators and one gradient-descent operator. Using the Banach 
fixed-point theorem ^291, pp.299-303], we establish the existence of a unique fixed point for the composite 
cascade. We then study how close each agent in the network converges towards this fixed point, and also 
examine how close the Pareto solution is to the fixed point. We perform a detailed mean-square error 
analysis and establish that all agents are able to converge to the same Pareto optimal solution within a 
sufficiently small mean-square-error (MSE) bound. We illustrate the results by considering an example 
involving collaborative decision in financial applications. 

Notation. Throughout the paper, all vectors are column vectors. We use boldface letters to denote random 
quantities (such as Uk^i) and regular font to denote their realizations or deterministic variables (such as 
Uk,i)' We use diagjxi, . . . , xat} to denote a (block) diagonal matrix consisting of diagonal entries (blocks) 
xi, . . . , xat, and use col{xi, . . . , xat} to denote a column vector formed by stacking xi, . . . , on top of 
each other. The notation x ^ y means each entry of the vector x is less than or equal to the corresponding 
entry of the vector y. 

II. Diffusion Adaptation Strategies 

In pT| , we motivated and derived diffusion strategies for distributed optimization, which are captured 
by the following general description: 

N 

= ^ o.i.ik'^i.i-i (5) 

^=1 N 

i^k,i = (l>k,i-l - l^k"^ CikVwJl{<^k,i-l) (6) 

N 

"^k.i = ^ Ci2,lki^l,i (7) 

1=1 

where Wk^i is the local estimate for w"^ at node k and time i, jik is the step-size parameter used by node k, 

and {(l)k,i-iii^k,i} are intermediate estimates for w^. Moreover, VwJi{') is the (column) gradient vector 

of Ji{') relative to w. The non-negative coefficients {ai^ik}, {cik}, and {a2^ik} are the (/, A:)-th entries of 

matrices Ai, C, and A2, respectively, and they are required to satisfy: 

/■ 

Ajl = 1, A^l = 1, CI = 1 

(8) 

^ cii,ik = 0, a2^ik = 0, cik^O a I ^ Mk 
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where 1 denotes a vector with all entries equal to one. Note from ([8]) that the combination coefficients 
{(^i,ik^ (^2,ik^ cik} are nonzero only for those / G A4- Therefore, the sums in ([5])-(|7]) are confined within the 
neighborhood of node k. Condition ([8]) requires the combination matrices {Ai, A2} to be left- stochastic, 
while C is right- stochastic. We therefore note that each node k first aggregates the existing estimates 
from its neighbors through ([5]) and generates the intermediate estimate (f)k,i-i' Then, node k aggregates 
gradient information from its neighborhood and updates (f)k,i-i to (pk.i through (|6]). All other nodes in the 
network are performing these same steps simultaneously. Finally, node k aggregates the estimates {0/,^} 
through step ([7]) to update its weight estimate to Wk^i- 

Algorithm (|5])-(|7]) can be simplified to several special cases for different choices of the matrices 
{Ai^A2^C}. For example, the choice Ai = I, A2 = A and C = I reduces to the adapt-then-combine 
(ATC) strategy that has no exchange of gradient information pT|-[[23|, (30): 



i^k.i ^/e,i-l - l^k'^wJk{'^k,i-l) 
UJk.i = ^ CLlki^l.i 



(ATC, C = /) 



(9) 



while the choice Ai = A, A2 = I and C = I reduces to the combine-then-adapt (CTA) strategy, where 
the order of the combination and adaptation steps are reversed relative to ([9]) [ [22| , [ |23| , [ [30| : 



"Wk^i = - l^k^wJk{^k,i-l) 



(CTA, C = /) 



(10) 



Furthermore, if in the CTA implementation ( [T0| ) we enforce A to be doubly stochastic, replace V^Jki') by 
a subgradient, and use a time-decaying step-size parameter (fik(i) 0), then we obtain the unconstrained 
version used by p4| . In the sequel, we continue with the general recursions ([5])-(|7]), which allow us 
to examine the convergence properties of several algorithms in a unified manner. The challenge we 
encounter now is to show that this same class of algorithms can still optimize the cost (|4]) in a distributed 
manner when the individual costs {Ji{w)} do not necessarily have the same minimizer. This is actually a 
demanding task, as the analysis in the coming sections reveals, and we need to introduce novel analysis 
techniques to be able to handle this general case. 
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III. Performance Analysis 

A. Modeling Assumptions 

In most situations in practice, the true gradient vectors needed in (|6]) are not available. Instead, perturbed 
versions are available, which we model as 

VwJi{w) VwJi(w) + vi^i{w) (11) 

where the random noise term, vi^i{w), may depend on w and will be required to satisfy certain conditions 
given by ([T6|)-([T7l). We refer to the perturbation in ( [TT] ) as gradient noise. Using ( [TT] ), the diffusion 



algorithm ([5])-(|7]) becomes the following, where we are using boldface letters for various quantities to 
highlight the fact that they are now stochastic in nature due to the randomness in the noise component: 

(12) 





y^^cii,ikm,i-i 




^=1 N 




<pk,i-l - l^k ^ 




N 




= ^ Ci2,lk'^l,i 
1=1 



cik \^wJi{4^k,i-i) + vi^i{<pk, 



i-i) 



(13) 
(14) 

Using ([T2|)-([T4l), we now proceed to examine the mean- square performance of the diffusion strategies. 
Specifically, in the sequel, we study: (i) how fast and (ii) how close the estimator w^^i at each node k 
approaches the Pareto-optimal solution in the mean-square-error sense. We establish the convergence 
of all nodes towards the same Pareto-optimal solution within a small MSE bound. Since we are dealing 
with individual costs that may not have a common minimizer, the approach we employ to examine the 
convergence properties of the diffusion strategy is fundamentally different from pT| ; we follow a system- 



theoretic approach and call upon the fixed-point theorem for contractive mappings |[29} pp. 299-303]. 

To proceed with the analysis, we introduce the following assumptions on the cost functions and gradient 
noise. As explained in |21|, these conditions are weaker than similar conditions in the literature of 
distributed optimization; in this way, our convergence and performance results hold under more relaxed 
conditions than usually considered in the literature. 

Assumption 1 (Bounded Hessian). Each component cost function Ji(w) has a bounded Hessian matrix, 
i.e., there exist nonnegative real numbers A/ min A/ max ^^ch that, for each k — 1, . . . ,N: 

A/,min^M < Vl^Jliw) < Xl^ma^lM (15) 

with J2i=i cikXi ,min 

> 0. ■ 
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Assumption 2 (Gradient noise). There exist a > and cr^ > such that, for all w G Ji-i: 

E{v^{w)\Ti-i}^0 (16) 
E{\\vi4w)f} <a-E\\V^Ji{w)f + a^^ (17) 

for all z, /, where Ti-\ denotes the past history of estimators {wkj} for j < i — I and all k. ■ 

If we choose C = I, then Assumption [ijimpHes that the cost functions {Ji{w)} are strongly conve}|^ 
This condition can be guaranteed by adding small regularization terms. For example, we can convert a non- 
strongly convex function J/ (it;) to a strongly convex one by redefining Ji(w) as Ji{w) ^ Ji{w) + e\\w\\'^, 
where e > is a small regularization factor. We further note that, assumption ( fTT) ) is a mix of the 
"relative random noise" and "absolute random noise" model usually assumed in stochastic approximation 
||4). Condition (Tf) implies that the gradient noise grows when the estimate is away from the optimum 



(large gradient). Condition ( [17] ) also states that even when the gradient vector is zero, there is still some 
residual noise variance a'?,. 



B. Diffusion Adaptation Operators 

To analyze the performance of the diffusion adaptation strategies, we first represent the mappings 
performed by ([T2])-([T4]) in terms of useful operators. 

Definition 1 (Combination Operator). Suppose x — col{xi, . . . , xat} is an arbitrary N xl block column 
vector that is formed by stacking M x 1 vectors xi, . . . ,XAr on top of each other The combination 
operator Ta is defined as the linear mapping: 

Ta{x) ^ {A^ ® Im) X (18) 

where A is an N x N left stochastic matrix, and denotes the Kronecker product operation. ■ 

Definition 2 (Gradient-Descent Operator). Consider the same N x 1 block column vector x. Then, the 

differentiable function f{x) on is said to be strongly convex if there exists a Amin > such that f{x + y) > 
/(x)+2/^V/(x)+Amin||2/|P/2 for any x,y ^ R^. And if f{x) is twice-differentiable, this is also equivalent to V^/(x) > Amin^^ 
|4| pp.9- 10]. Strong convexity implies that the function f{x) can be lower bounded by some quadratic function. 
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gradient-descent operator Tq is the nonlinear mapping defined by: 

Tg{x) ^ : 

Xn - IJ^N Y^lLi Cin'^wJi^Xn) 



(19) 



Definition 3 (Power Operator). Consider the same N x 1 block vector x. The power operator P : 
^MN _^ (iefified as the mapping: 



where 



P[x] ^col{||xif,...,||xjvf} 
denotes the Euclidean norm of a vector 



(20) 



We will use the power operator to study how error variances propagate after a specific operator Ta{-) 
or Tg{-) is applied to a random vector. We remark that we are using the notation "P[ ]" rather than "P( )" 
to highlight the fact that P is a mapping from to a lower dimensional space R^. In addition to 

the above three operators, we define the following aggregate vector of gradient noise that depends on the 
state x: 



{N N \ 

CliVi{xi), . . . , IIN^ CinVi{xn) > 
1=1 1=1 J 



(21) 



With these definitions, we can now represent the two combination steps ([12]) and ([14]) as two combination 
operators Ta^ (•) and Ta^ (•). We can also represent the adaptation step ( [T3] ) by a gradient-descent operator 
perturbed by the noise operator ( [2T] ): 



Tg{x)^Tg{x)^v{x) 



(22) 



We can view Tg{x) as a random operator that maps each input x G R^^ into an R^^ random vector, 
and we use boldface letter to highlight this random nature. Let 



Wi = CO\{wi^i, W2^i, . . . , WN,i} 



(23) 



denote the vector that collects the estimators across all nodes. Then, the overall diffusion adaptation steps 
([T2l)-([T4l) that update Wi-i to Wi can be represented as a cascade composition of three operators: 



Td{-)^TA,oTGoTAA- 



(24) 
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(a) TaA')^Ta,{-) andTG(- 



(b) Cascade representation of diffusion adaptation. 



Fig. 3. Representation of the diffusion adaptation strategy (p^-|T4]) in terms of operators. Each diffusion adaptation step can 
be viewed as a cascade composition of three operators: Ta^{-), Tg(-), and Ta2(-) with gradient perturbation v{-). If v{-) = 0, 
then fd{-) becomes Td(-). 

where we use o to denote the composition of any two operators, i.e., Ti o T2{x) = ri(r2(x)). If there 



is no gradient noise, then the diffusion adaptation operator ([24]) reduces to 

T^(-) = Ta,oTgoTa,(-) 



(25) 



In other words, the diffusion adaptation over the entire network with and without gradient noise can be 
described in the following compact forms: 



Wi Td{wi-i) 
Wi Td{wi-i) 



(26) 
(27) 



Fig. 3(a) illustrates the role of the combination operator T^( ) (combination steps) and the gradient- 
descent operator Tg{-) (adaptation step). The combination operator Ta{-) aggregates the estimates from 
the neighborhood (social learning), while the gradient-descent operator Tg{-) incorporates information 



from the local gradient vector (self-learning). In Fig. |3(b)[ we show that each diffusion adaptation step 
can be represented as the cascade composition of three operators, with perturbation from the gradient 
noise operator. Next, in Lemma [TJ we examine some of the properties of the operators {Ta-^^Ta^^Tq}, 
which are proved in Appendix [A] 

Lemma 1 (Useful Properties). Consider N xl block vectors x — col{xi, . . . , xa^} and y = col{?/i, . . . , i/n} 
with M X 1 entries {xk^yk}- Then, the operators Ta(-), Tg(-) and P[-] satisfy the following properties: 

1) (Linearity): Ta{-) is a linear operator 

2) (Nonnegativity): P[x] >z 0. 
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3) (Scaling): For any scalar a G R, we have 

P[ax] = a^P[x] (28) 

4) (Convexity): suppose x^^\ . . . , x^^^ are N xl block vectors formed in the same manner as x, and 
let ai^ . . . ^ ax be non-negative real scalars that add up to one. Then, 

P[aix(^) + • • • + axx^^^] ^ aiP[x^^^] + • • • + axPix^^^] (29) 

5) (Additivity): Suppose x — coljcci, . . . ,xn} and y — col{yi, . . . , jjn} are N x 1 block random 
vectors that satisfy Kx^jjk — for k — 1^ . . . ^N. Then, 



6) ( Variance relations) : 



where 



EP[x + 2/] = EP[x] + EP[2/] (30) 

P[Ta{x)] ^ A^P[x] (31) 
P[Tg{x) - TG{y)] ^ V^P[x - y] (32) 

r = diag{7i, . . . ,7iv} (33) 

= max{|l - /XA:Cr/c,max|, \^ - l-^kCTkM^W (^4) 

N N 

,max 

(35) 

1=1 1=1 

7) ( Block Maximum Norm ): The oo— norm of P[x] is the squared block maximum norm of x: 

II^NIloo = Ikll6,oo - ( llXfcll) (36) 

\ l<k<N / 

8) (Preservation of Inequality): Suppose vectors x, y and matrix F have nonnegative entries, then 
X ^ y implies Fx ^ Fy. ■ 

C. Transient Analysis 

Using the operator representation developed above, we now analyze the transient behavior of the 
diffusion algorithm ([T2|)-([T4|). From Fig. |3(b)| and the previous discussion, we know that the stochastic 



A 



recursion Wi = Td{wi-i) is a perturbed version of the noise-free recursion Wi = Td{wi-i). Therefore, we 
first study the convergence of the noise free recursion, and then analyze the effect of gradient perturbation 
on the stochastic recursion. 
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Intuitively, if recursion Wi = Td{wi-i) converges, then it should converge to a vector Woo that satisfies 

^oo Td{woc) (37) 

In other words, the vector Woo should be sl fixed point of the operator Td{-) |[29j p. 299]. We need to 
answer four questions pertaining to the fixed point. First, does the fixed point exist? Second, is it unique? 
Third, under which condition does the recursion Wi = Td{wi-i) converge to the fixed point? Fourth, how 
far is the fixed point Woo away from the minimizer of (|4])? We answer the first two questions using 
the Banach Fixed Point Theorem (Contraction Theorem) [29, pp. 2-9, pp.299-300]. Afterwards, we study 
convergence under gradient perturbation. The last question will be considered in the next subsection. 

Definition 4 (Metric Space). A set X, whose elements we shall call points, is said to be a metric space 
if we can associate a real number d{p, q) with any two points p and q of X, such that 

(a) d{p, > ifp ^ q; d{p,p) = 0; 

(b) d{p,q) = d{q,p); 

(c) d{p^ q) < d{p, r) + d{r^ q), for any r G X. 

Any function d{p^ q) with these three properties is called a distance function, or a metric, and we denote 
a metric space X with distance d{-^ •) as (X, d). ■ 

Definition 5 (Contraction). Let (X, d) be a metric space. A mapping T : X — > X is called a contraction 
on X if there is a positive real number 5 <1 such that d(T(x)^T(y)) < S • d(x^ y) for all x, y G X 

Lemma 2 (Banach Fixed Point Theorem 1291). Consider a metric space (X, d), where X 7^ 0. Suppose 
that X is complet^ and let T : X ^ X be a contraction. Then, T has precisely one fixed point. ■ 

As long as we can prove that the diffusion operator T^( ) is a contraction, i.e., for any two points 
X, y G R^^, after we apply the operator T(i{-), the distance between T(i{x) and T(i{y) scales down by a 
scalar that is uniformly bounded away from one, then the fixed point w^q defined in ([37]) exists and is 
unique. We now proceed to show that r^( ) is a contraction operator in X = when the step-size 

parameters {{ik} satisfy certain conditions. 

Theorem 1 (Fixed Point). Suppose the step-size parameters {i^k} satisfy the following conditions 

0<fik<^—, A: = 1,2,...,X (38) 

metric space (X, d) is complete if any of its Cauchy sequences converges to a point in the space; a sequence {xn} is 
Cauchy in {X, d) if Ve > 0, there exists N such that d{xn,Xm) < e for all n, m > N. 



August 14, 2012 



DRAFT 



13 



Then, there exists a unique fixed point for the unperturbed diffusion operator T(i{-) in 

Proof: Let x = col{xi, . . . , xat} G R^^^i be formed by stacking M x 1 vectors xi, . . . , on 
top of each other. Similarly, let y = col{^i, . . . , y^}. The distance function y) that we will use is 
induced from the block maximum norm ([36]): d{x^ y) = \\x — y\\b,oo = niaxi</e<Ar \\xk — yk\\- From the 
definition of the diffusion operator T^( ) in ( [25] ), we have 

P[Td{x) - Td{y)] P [Ta, (Tg o Ta, (x) - Tq o Ta, (y))' 

(b) 



TGoTA,{x)-TGoTA,iy) 



^^Alr^P[TAdx)-TAM] 
^^A^r^P[TAAx-y)] 

? A^T^AjP[x - y] (39) 

where steps (a) and (d) are because of the linearity of Iai(-) and Ta2{-), steps (b) and (e) are because 
of the variance relation property ( |3T] ), and step (c) is due to the variance relation property ( |32l ). Taking 
the oo— norm of both sides of (|39l), we have 



\\P[Td{x) - Td{y)]\\oo < WA^r^AjWoo ■ \\P[x - y]\\oo 

< ||r||^.||P[x-y]|U (40) 

where, in the second inequality, we used the fact that ||oo = ||A^||oo = 1 since Aj and are 
right- stochastic matrices. Using property ( [36] ), we can conclude from ( [40] ) that: \\Td{x) — Td{y)\\b,oo ^ 
||r||oo • — y\\b,oo' Therefore, the operator Td{-) is a contraction if ||r||oo < 1, which, by substituting 



([33])-p4]), becomes 

|1 - M/cC^/c,max| < 1, |1 - fJ^kCrk,mm\ < 1, = 1, . . . , TV 



and we arrive at the condition ( [38] ) on the step-sizes In other words, if condition ( [38] ) holds for each 
A: = 1, . . . , A^, then T^( ) is a contraction operator. By Lemma [2] the operator T^( ) will have a unique 



fixed point Woo that satisfies equation (37). 



Given the existence and uniqueness of the fixed point, the third question to answer is if recursion 
Wi = Td{wi-i) converges to this fixed point. The answer is affimative under ( [38] ). However, we are not 
going to study this question separately. Instead, we will analyze the convergence of the more demanding 
noisy recursion ( [26] ). Therefore, we now study how fast and how close the successive estimators {wi} 
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generated by recursion ( |26l ) approach Woo- Once this issue is addressed, we will then examine how close 
Woo is to the desired w°. Introduce the following mean-square-perturbation (MSP) vector at time i: 



MSPi = EP[wi - Wo 



(41) 



The k-th entry of MSPj characterizes how far away the estimate Wk^i at node k and time i is from Wk^oo 
in the mean-square sense. To study the closeness of Wi to Woo, we shall study how the quantity MSPj 



evolves over time. By (|26]), pT) and the definitions of Td{-) and Td{-) in ( |24l ) and ( |25] ), we obtain 

MSPi = EP[wi - Woo] 



(a) 



Ta, o Tg o Ta, (Wi-i) - Ta, o Tg o Ta, {Woc 

EP [Ta, (fa o Ta, (wi-i) - Tg o Ta, {woo)) 

Tg o Ta, (wi-i) - Tg o Ta, (woo) 
TG(TA,{wi-i)'^ -TolTA.iwoo)) +v(TA,{wi-i] 
A^{Ep[TG(TA,{wi.i)'^ -Tg(Ta,{woo))] +Ep[v(TAdwi-i))]} 



^ A^EP 
A'^EP 

W AT. 



? A^r^EP[TA,{wi.i) - Ta,{woo)] + A^EP[v{TA,{wi.i))] 
if) 

^ A^r^Aj ■ EP[wi-i - Woo] + A'^EPiviTA.iwi-i))] 
= A^r^Aj • MSPi_i + A^EP[v{TA,{wi-i))] 



(42) 



where step (a) is by the linearity of Ta^{-), steps (b) and (f) are by property ( |3T] ), step (c) is by the 
substitution of ( |22l ), step (d) is by Property 5 in Lemma [T] and assumption ( fT6l l, and step (e) is by (|32]l. 
To proceed with the analysis, we establish the following lemma to bound the second term in (|42l). 



Lemma 3 (Bound on Gradient Perturbation). It holds that 

EP[v{TAAwi-i))] ^ 4aXl^JC\\ln^Aj-EP[wi-i-Woo] + \\C\\jnX 

where 

'^max — max max 

l<k<N 

by = 4aXl,^^Aj P[woo -In® w""] 

„o\\\2 



+ max {2a||V^JfcK)r + ^v} 

l<k<N 

Q = diag{/ii, . . . ,/iiv} 



(43) 

(44) 

(45) 
(46) 
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Proof: By the definition of v{x) in ( [21] ) with x = TA^{wi-i) being a random vector, we get 



EP[i;(x)] = 



(47) 



For each block in ( |47l ), using Jensen's inequahty, we have 



AT 



AT 



Clk 



1=1 2^1=1 ^Ik 



N ^ N 



Clk 



l=\ /=1 ^Ik 

N 

< l|C||i5^Q,[aE||V^JKx,)f + ^2] (48) 
where || • ||i denotes the maximum absolute column sum, and in the last step, we used ( fTTl ). Using ( |123| ), 
^wMxk) = V^JK^") + (^^ + - {xk - (49) 

From ( |124| ) and the norm inequality ||x + < 2||x|p + 2||?/|p, we obtain 

<2||V^JK^")f + 2ALx-||^fc-^1l' (50) 



Substituting ([50]) into (|48]), we obtain 

e|| J^Qfci;K^it)|| < ||C||iJ^Qfc[2aA^^,E||xfc-^^f + 2a||V^JK^")|^ 



^=1 

v2 ||r^||2 



<2aALxnii-E||xfc-«;°f + ||C||^-a, 



|2 ^2 



(51) 



where al = max {2a\\VwJi{w°)\\'^ + a^}. Substituting (|5T]) and x = TA,{wi-i) into ^ leads to 



l</<Af 



(a) 



I AT It; 



H|C||?aSl^} 
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- + 



^2 aT 



2o2 



(52) 



where step (a) is due to the fact that A{ is right- stochastic so that T^^ (1 at w"^) = 1 at ® tt;^, step (b) is 



because of the Unearity of Ta^{-), step (c) is due to property ( [3T] ), step (d) is a consequence of Property 
3 of Lemma [1} and step (e) is due to the convexity property ( |29l ). ■ 



Substituting d43j into (42), we obtain 



MSPi ^ A'^FdAj • MSPi_i + • A'^n'b, 



where 



(53) 



(54) 



The following theorem gives the stability conditions on the inequality recursion ( [53] ) and derives both 
asymptotic and non-asymptotic bounds for MSP. 

Theorem 2 (Mean-Square Stability and Bounds). Suppose A2TdAj is a stable matrix, i.e., p{A^V(iA^) < 
1. Then, the following non-asymptotic bound holds for all i > 0: 

MSPi ^ {A^TdA{y[MSPo - MSP^] + MSP^ 

where MSP^ is the asymptotic upper bound on MSP defined as 

MS^t = \\C\\1{In - A^raAjr^A^QH, 



(55) 



And, as i ^ oo, we have the following asymptotic bound 

limsupMSPi ^ MSP^ 
Furthermore, a sufficient condition that guarantees the stability of the matrix Ai^TdAii is that 

0<IJ.k< min • 



for all k — 1^ ... ^N, where cr^^^^^max ci^d cr k^rmn ^ere defined earlier in 



(56) 



(57) 



(58) 



Proof. Iterating inequality ( [53] ), we obtain 

MSPi < (^i^rX)^MSPo + • ^{AlVaAl 



i-l 



j=0 



A.2^ by 



(59) 
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For the second term in ([59]), we note that (/ + X H h - X) = / - X\ If X is a stable 



matrix so that (/ - X) is invertible, then it leads to J2]=o = " -^)~^' Using this relation 



as 



and given that the matrix A2TdAj is stable, we can express 

MSP, ^ {A^TdAjyuSFo + • [/^-(^i^r^Af )^] {lN-A^TdAj)-^A^nX 

= (A^TdAjy [MSPo - MSP^b] + MSP^b (60) 

Letting i ^ oo on both sides of the above inequality, we get limsupMSP^ ^ MSP^. In the last step, 
we need to show the conditions on the step-sizes {jj^k} that guarantee stability of the matrix A^TdAj. 
Note that the spectral radius of a matrix is upper bounded by its matrix norms. Therefore, 



< 11/4^11 

— 11^2 lloo 



lir.llc 



1^1 lloo 



lir.llc 



\c\\l-n' 



If the right-hand side of the above inequality is strictly less than one, then the matrix A^V^^ is stable. 



Using (|33])-(|34l), this condition is satisfied by the following quadratic inequalities on fij^ : 

(1 - Mife^it,max)^ + Mfc • 4c^A^axl|C^II? < 1 (61) 
(1 - l^k(^k,ininf + f4 ' ^(^^maJ\C\\l < 1 (62) 

for all A: = 1, . . . , A^. Solving the above inequalities, we obtain condition ( |58l ). ■ 
The non-asymptotic bound ([55]) characterizes how the MSP at each node evolves over time. It shows that 
the MSP converges to steady state at a geometric rate determined by the spectral radius of the matrix 
A2TdAi. The transient term is determined by the difference between the initial MSP and the steay-state 
MSP. At steady state, the MSP is upper bounded by MSP^. We now examine closely how small the 



steady-state MSP can be for small step- size parameters {jJ^k}- Taking the oc— norm of both sides of ( [57] ) 
and using the relation {In - A^FdAj)-^ = J^JLoi^I^d^JV . we obtain 



IMSP' 



Ubll 

cx) lloo 



\\c\\l-{ir,-A'^rdAj)-'.A^nX 



<\\c\\i-[J2\\^2\\L-\\mL-\\Aj\\i. 



I 4^11 . IIOII^ 
1^2 lloo ll^^llcx) 



(a) 

< \\c\\i 



( oo 

E 



max LLk 

l<k<N 
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' max jiik I (63) 



l-llr^lloo \l<k<N 

where step (a) is because and are right- stochastic matrices so that their oo— norms (maximum 
absolute row sum) are one. Let //max and //min denote the maximum and minimum values of {jJ^k}^ 
respectively, and let /3 = /Xmin/Mmax- For sufficiently small step-sizes, by the definitions of and F in 
and ([33]), we have 



llFdIloo < ||F||^ + 4aA^ax||C||M|f^| 



2 

oo 



l<k<N 

< 1 - min + Mmax (^max + 4q: A^ax 1 1 1 1 1 ) 

= l-2/3/Xmax^^min + Mmax(^max + 4Q:Vax||C'||i) (64) 

where amax and (Jmin are the maximum and minimum values of {cr/e,max} and {cr/e,min}. respectively, 
and step (a) holds for sufficiently small step-sizes. Note that ( [63] ) is a monotonically increasing function 
of ||F(i||oo. Substituting ( [64] ) into ( [63] ), we get 

iir:'i|2 . 11^ II . n 

(65) 



limsupllMSP.Iloo < llMSP^^Iloo < j^''^;''t^''^'/7^ Iiri|2, -0(A^^ 

i^oo ^P^min /^maxv^max~^'^^^niax||^ |llJ 



Note that, for sufficiently small step-sizes, the right-hand side of ([65]) is approximately ^^rfc^^^^Umax, 



which is on the order of 0(/Xmax)- In other words, the steady-state MSP can be made be arbitrarily small 
for small step-sizes, and the estimators Wi = co\{wi^i^ . . . ^w^^i} will be close to the fixed point Woo (in 
the mean-square sense) even under gradient perturbations. To understand how close the estimate Wk^i at 
each node k is to the Pareto-optimal solution w^, a natural question to consider is how close the fixed 
point Woo is to 1 AT ® w^, which we study next. 

D. Bias Analysis 

Our objective is to examine how large \\1n ® — it;oo|P is when the step-sizes are small. We carry 
out the analysis in two steps: first, we derive an expression for w^o = Iat ® it;^ — Woo, and then we 
derive the conditions that guarantee small bias. 

To begin with, recall that Woc is the fixed point of Td{-), to which the recursion Wi = Td{wi-i) 
converges. Also note that Td{-) is an operator representation of the recursions ([5])-([7]). We let i ^ oo on 
both sides of (|5|)-([7|) and obtain 

AT 

0/c,oo ^ Cil,lk m,oo (66) 
1=1 
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N 



N ^ 



(67) 
(68) 



where Wk^oo, (t^k.oo and ipk.oo denote the hmits of Wk^i, (j^k.i and ^pk.i as z ^ oc, respectively. Introduce 
the following bias vectors at node k 



3, ^/c,oo = ^""-^/c 



Subtracting each equation of ([66])-([68]) from and using relation VwJi{<^k,oo) = ^wJi{^^)—Hik, 
that can be derived from Lemma |4] in Appendix |A| we obtain 



N 



AT 



-l^k^cikHik 



/=1 



AT 



) 



N 



/=1 



where i?^it,cx) is a positive semi-definite symmetric matrix defined as 



lk,oo 



^ 1=1 



lk<JOLoo)]dt 



Introduce the following global vectors and matrices 



Woo - ® yo"" - Woo ^ CO\{wi^oo, • • • , WN.oo} 

Ai = Ai® Im, A2 = A2®Im, C = C® Im^ 
M = diag{/xi, . . . , iin] ® Im 

N 

T^oo — diag|Qig/i^oo, • • • , QAri?/Ar,oo|, 



1=1 



Then, expressions ([70]), ( |72| ) and ( [TT] ) lead to 



It;. 



(69) 

(70) 
(71) 
(72) 

(73) 

(74) 
(75) 
(76) 

(77) 
(78) 

(79) 



Theorem 3 (Bias at Small Step-sizes). Suppose that A^A[ is a regular right-stochastic matrix, so that 
its eigenvalue of largest magnitude is one with multiplicity one, and all other eigenvalues are strictly 
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smaller than one. Let 6^ denote the left eigenvector of of eigenvalue one. Furthermore, assume 

the following condition holds: 



(80) 



where = diag{/ii, . . . , /iA^} was defined earlier in Lemma [j] and cq is some constant. Then, 

O(mLx) (81) 



Proof: See Appendix |Bj 



Therefore, as long as the network is connected (not necessarily fully connected) and condition ([80]) holds, 
the bias would become arbitrarily small. For condition ( [80| ) to hold, one choice is to require the matrices 
and A^ to be doubly stochastic, and all nodes to use the same step-size /x, namely, Q = jj^I^. In that 
case, the matrix AJA2 is doubly-stochastic so that the left eigenvector of eigenvalue one is 9^ = 1^ 
and ([80]) holds. 

Finally, we combine the results from Theorems [2] and [3] to bound the mean-square-error (MSE) of the 
estimators {wk^i} from the desired Pareto-optimal solution w^. Introduce the x 1 MSE vector 

MSE^ = EP[wi] 

= EP[1n®w'' -Wi] 
= co\I^E\\wi^if,...,E\\wN,if} 
Using Properties 3-4 in Lemma [1} we obtain 

MSE,: = EP [2 f + - 



(83) 



(82) 



2 2 

^ 2P[w^]+2EP[woo-Wi] 



Taking the oo— norm of both sides of above inequality and using property ( [361 ), we obtain 

limsup||MSEi||oo < 2||P[u}oo]||oo + 21imsup ||MSPi||oo 



2\\w, 



+ 21imsup ||MSPi 



(84) 



where in the last step, we used ( |65l ) and ( [81] ), and the fact that all vector norms are equivalent. Therefore, 
as the step-sizes become small, the MSEs become small and the estimates {wk^i} get arbitrarily close 
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to the Pareto-optimal solution w^. We also observe that, for small step-sizes, the dominating steady-state 
error is MSP, which is caused by the gradient noise and is on the order of 0(/imax)- On the other hand, 
the bias term is a high order component, i.e., O(i^max)' be ignored. 



The fact that the bias term Woo is small also gives us a useful approximation for T^oo in ( [77] ). Since 
Woo = col{it;i^oo5 • • • 5 *A^,cx)} is small for small step-sizes, the matrix Hik^oo defined in ( [73] ) can be 
approximated as i?//e,cx) ^ ^^Ji{^^)' Then, by definition ( [771 ), we have 



N 








1=1 





(85) 



Expressing (85) is useful for evaluating closed-form expressions of the steady-state MSE in sequel. 



E. Steady-State Performance 

So far, we derived inequalities ([84]) to bound the steady-state performance, and showed that, for small 
step-sizes, the solution at each node k approaches the same Pareto-optimal point w^. In this section, 
we derive closed-form expressions (rather than bounds) for the steady-state MSE at small step-sizes. 
Introduce the error vector^ 



and the following global random quantities 

N 

=y^diag|QiiJ/i^^_i, • • • ,QAri?/Ar,i-i| 
1=1 

fji =^COl|Qii;/((/)i,i_i), • • • ,CiNVi{(j)N,i-l)^ 

1=1 

Then, extending the derivation from |21, Sec. IV A], we can establish that 

Wi = Al[lMN-Mni-i]Alwi-i-\-AlMC^g''-^AlMgi 



(86) 

(87) 
(88) 

(89) 

(90) 

(91) 



According to ( [84] ), the error Wk^i at each node k would be small for small step-sizes and after long enough 
time. In other words, Wk^i is close to w^. And recalling from ([12]) that (pk,i-i is a convex combination 

^ In this paper, we always use the notation w = — w to denote the error relative to w^. For the error between w and the 
fixed point Woo, we do not define a separate notation, but instead write Woo — w explicitly to avoid confusion. 
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of {wi^i}, we conclude that the quantities {(f)i^i-i} are also close to w°. Therefore, we can approximate 
Hik,i-i, Ui-i and gi in (|88]l-(|90ll by 

-i~ CylJi{w°)dt^VlJi{w°) 
Jo 

N 



Hi 



(92) 
(93) 



Then, the error recursion ( |9T] ) can be approximated by 



Wi = A2[lMN-Mnoo]A\wi-i+A2MC^g"+A2Mg, 



(94) 



First, let us examine the behavior of Etij. Taking expectation of both sides of recursion ( [94| ), we obtain 

Etij = AI[Imn - Mnoo]Aj¥.Wi-i + AlMC^g° (95) 

This recursion converges when the matrix ^2^[lMiv — A^T^ooJ-^T is stable, which is guaranteed by 
(see Appendix C of |21|). Let i — > oo on both sides of ( |95] l so that 



(96) 



^Woo = lim ¥.Wi 






Imn—^ {Imn—MTZoo) Ai 





Note that Eii^oo coincides with (|79]). By Theorem [3] we know that the squared norm of this expression 
is on the order of 0(/^max) small step-sizes — see ( [ST] ). Next, we derive closed-form expressions for 
the MSEs, i.e., E||ii?/e^i|p. Let Ry denote the covariance matrix of gi evaluated at w^: 



N 



N 



(97) 



^ col I Qi • ^CiNVi^i{w'')'^ ^col^^ciiVi^i{w''), - - - , Q^VU/^i ('u;'') | 

.1=1 J L/=i J J 

In practice, we can evaluate Ry from the expressions of {vi^i{w^)}. Equating the squared weighted 
Euclidean "norm" of both sides of ([94]), applying the expectation operator with assumption ( [T6| ), and 
following the same line of reasoning from [21], we can establish the following approximate variance 
relation at small step-sizes: 

E\\w,\\l ^ E||ii;,_i|||, + TT{J:A^MRyMA2) + TT{^A^MC^g%A^MC^g''f} 

+ 2{A^MC^gY^A^ {Imn-MTZoo) A[¥.w^^i (98) 



^' ^ Ai {Imn-MUoo) A2^Al {Imn-MUoo) AI 



(99) 



where H is a positive semi-definite weighting matrix that we are free to choose. Let a = vec(Il) denote 
the vectorization operation that stacks the columns of a matrix H on top of each other. We shall use the 
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notation and interchangeably. Following the argument from 121), we can rewrite (|98l) as 



Ellm 



E||iyi_i||^^ + r^cr + a' Q Ewi-i 



* no- 



where 



F = Al[lMN-Mnoo\A2®Al[lMN-Mnoo\A2 

r ^ wec{AlMKMA2)+AlMC^g°(^A^MC'^g° 
Q = 2AI{Imn-MTZoo)AJ A^MC^g" 



(100) 

(101) 
(102) 
(103) 



We already established that Ewi-i on the right-hand side of ( |100| ) converges to its limit Ewoo under 
condition ([38]). And, it was shown in (31] pp.344-346] that such recursion converges to a steady-state 
value if the matrix F is stable, i.e., p{F) < 1. This condition is guaranteed when the step-sizes are 
sufficiently small (or chosen according to (|38])) — see the proof in Appendix C of (21). Letting i — ^ oo 
on both sides of expression ( |100| l, we obtain: 



lim E\\wi\\fj_p.^ ^{r + Q Ewoo) a 



(104) 



We can now resort to ( |104| ) and use it to evaluate various performance metrics by choosing proper 
weighting matrices H (or a). For example, the MSB of any node k can be obtained by computing 
lim^^oo with a block weighting matrix T that has an identity matrix at block (k^k) and 

zeros elsewhere: lim E||tt;/e,i|P — 1™ Denote the vectorized version of this matrix by = 

vec(diag(e/e) ®Im), where Ck is a vector whose fcth entry is one and zeros elsewhere. Then, if we select 
a in ( |104| ) as a = (/ — F)~^tk, the term on the left-hand side becomes the desired lim^^oo 
and the MSE for node k is therefore given by: 



MSEk = lim E||ii;/e,if ^ {r + Q Ew^^)^ {I-PyHk 



If we are interested in the average network MSE, then it is given by 



1 ^ 

MSE 4 MSE, 



(105) 



(106) 



k=i 



IV. Application to Collaborative Decision Making 

We illustrate one application of the framework developed in the previous sections to the problem of 
collaborative decision making over a network of N agents. We consider an application in finance where 
each entry of the decision vector w denotes the amount of investment in a specific type of asset. Let 
the M X 1 vector p represent the return in investment. Each entry of p represents the return for a 
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unit investment in the corresponding asset. Let p and Rp denote the mean and covariance matrix of p, 
respectively. Then, the overall return by the agents for a decision vector w is p^w. Note that, with decision 
w, the return p^w is a (scalar) random variable with mean p^w and variance var(p^it;) = w^RpW, which 
are called the expected return and variance of the return in classical Markowitz portfolio optimization |[2j 
p. 155], [(32|-[[35|. These two metrics are often used to characterize the quality of the decision w: we want 
to maximize the expected return while minimizing the variance. However, solving the problem directly 
requires all agents to know the global statistics p and Rp. What is available in practice are observations 
that are collected at the various nodes. Suppose a subset U of the agents observes a sequence of return 
vectors {uk^i} with ¥^Uk^i = p. The subscripts k and i denote that the return is observed by node k at 
time i. Then, we can formulate the cost functions for the nodes in set U as follows: 

Ju,kH = -n^h^] = -p^^ (107) 

keU C{1,...,N} 



We place a negative sign in ( |107| ) so that minimizing Ju,k{^) is equivalent to maximizing the expected 
return. Similarly, suppose there is another subset of nodes, exclusive from U and denoted by 5, which 
observes a sequence of centered return vectors {sk^i}, namely, vectors that have the same distribution as 
p — Ep so that K[sk^iS^-] = Rp. Then, we can associate with these nodes the cost functions: 

J^^j^{w) =E \sl^-w\^ ^w^RpW (108) 

A;g5c{1,...,7V} 

Additionally, apart from selecting the decision vector w to maximize the return subject to minimizing 
its variance, the investment strategy w needs to satisfy other constraints such as: i) the total amount of 
investment should be less than a maximum value that is known only to an agent /cq E /C (e.g., agent ko 
is from the funding department who knows how much funding is available), ii) the investment on each 
asset be nonnegative (known to all agents), and iii) tax requirements and tax deduction^ known to agents 
in a set We can then formulate the following constrained multi-objective optimization problem: 

I keU keS J 

^For example, suppose the first and second entries of the decision vector w denote the investments on charity assets. When the 
charity investments exceed a certain amount, say 6, there would be a tax deduction. We can represent this situation by writing 
h^w > b, where h = [1 1 • • • 0]^. 
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s.t. l^w < bo 

h^w > bk, k 



(110) 
(111) 
(112) 



Using the scalarization technique and barrier function method from Sec. |l| we convert ( |109| )-( [TT2l ) into 
the following unconstrained optimization problem (for simplicity, we only consider 7ri=-=7rAr = l): 



W) 



M 



kes 

+ E 

ken 

+ E 

keJC 

where (/)(•) is a barrier function to penalize the violation of the constraints — see |3 1 for an example, and 
the vector G is a basis vector whose entries are all zero except for a value of one at the mth entry. 



m=l 






M 












m=l 








M 




hlw) + 






m=l 






M 




-M + 


^ (t){-e^w) 









The term J2 



M 

m=l ' 



-e^w) is added to each cost function to enforce the nonnegativity constraint ( |112| ), 



which is assumed to be known to all agents. Note that there is a "division of labor" over the network: 
the entire set of nodes is divided into four mutually exclusive subsets {1, . . . , N} = UUSUT-LUlC, 
and each subset collects one type of information related to the decision. Diffusion adaptation strategies 
allow the nodes to arrive at a Pareto-optimal decision in a distributed manner over the network, and each 
subset of nodes influences the overall investment strategy. 

In our simulation, we consider a randomly generated connected network topology. There are a total 
of = 10 nodes in the network, and nodes are assumed connected when they are close enough 
geographically. The cardinalities of the subsets U, S, % and K are set to be 3, 4, 2 and 1, respectively. The 
nodes are partitioned into these four subsets randomly. The dimension of the decision vector is M = 5. 
The random vectors Uk^i and Sk^i are generated according to the Gaussian distributions A/'(1,/m) and 
A/'(0, /m), respectively. We set 6o = 5 and the parameters {hk^ b^} for k ^% to 



hk, = [1 2 
hk. = [5 4 



5] , bk,^2 
1] , ^fc. = 3 



(113) 
(114) 
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(a) Topology of the network. 




(c) MSE for different values of step- sizes. 
Fig. 4. Simulation results for collaborative decision making 
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(b) Learning curve (/i = 10 ^). 
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(d) Error of fixed point for different values of step- 
sizes. 



where ki and k2 are the indices of the two nodes in the subset H. Furthermore, we use the barrier 
function given by (15) in [3| in our simulation with t = 10, p = 0.1 and r = 0.1. We set the combination 
coefficients {aik} to the Metropolis rule (See Table III in j23| ) for both ATC and CTA strategies. The 
weights {cik} are set to cik = 1 for / = A: and zero otherwise, i.e., there is no exchange of gradient 
information among neighbors. According to Theorem [3} such a choice will always guarantee condition 



( |80| ) so that the bias can be made arbitrarily small for small step-sizes. In our simulation, we do not 
assume the statistics of {uk^i} and {s/c,i} are known to the nodes. The only information available is their 
realizations and the algorithms have to learn the best decision vector w from them. Therefore, we use 



August 14, 2012 



DRAFT 



27 



the following stochastic gradient vectoij^ at each node k: 

-Uk,i + Em=l ^vo<i){- 



V^cj^il^w - bo) + E 



m=i '^yj'- 



-eLw) 



keS 
ken 
keJC 



(115) 



To compare the performance with other algorithms, we also simulate the consensus-based approach from 
[ 16 | with the same stochastic gradienj^as ( |115| ). The algorithm is listed below: 



'Wk,i = ^ aikWi^i-i - iiVwJkiwk^i-i) 
leJVk 



(116) 



Furthermore, we also simulate the conventional centralized approach to such optimization problem, which 
collects data from all nodes and implements stochstic gradient descent at the central node: 



1 ^ . . 

k=l 



(117) 



where the factor of is used to make the convergence rate the same as the distributed algorithms. 



The simulatin results are shown in Fig. |4(a)f[4(d)] Fig. |4(a)| shows the network topology, and Fig. 
4(b)| shows the learning curves of different algorithms. We see that ATC outperforms CTA and CTA 
outperforms consensus. To further compare the steady-state performance, we plot the steady-state MSB 
for different values of step-sizes in Fig. |4(c)[ We also plot the theoretical curves from ( |105| )-( [T06| ) for 
ATC and CTA algorithms. We observe that all algorithms approach the performance of the centralized 
solution when the step-sizes are small. However, diffusion algorithms always outperform the consensus- 
based strategy; the gap between ATC and consensus algorithm is about 8 dB when = 0.1. We also see 
that the theoretical curves match the simulated ones well. Finally, we recall that Theorem |3] shows that 
the error between the fixed point Woo and l0w^ can be made arbitrarily small for small step-sizes, and 
the error \\woo — 1 ® is on the order of 0{iJ?). To illustrate the result, we simulate the algorithms 
using true gradients {S/ujJki'^)} ^hat they converge to their fixed point Woo, and we get different 
values of Woo for different step-sizes. The theoretical values for ATC and CTA can be computed from 



( |79| ). The results are shown in Fig. |4(d)[ We see that the theory matches simulation, and the power of 

^For nodes in H an /C, the cost functions are known precisely, so their true gradients are used. 
^The original algorithm in \\^\ does not use stochastic gradients but the true gradients {V wJk(w)}. 
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the fixed point error per nodq^ decays at 20dB per decade, whicii is 0{iJ?) and is consistent with ( [ST] ). 
Note that diffusion algorithms outperform the consensus. Also note from ( [791 ) and ([96]) that the bias and 



the fixed point error have the same expression. Therefore, diffusion algorithms have smaller bias than 



consensus (the gap in Fig. 4(d) is as large as 5dB between ATC and consensus). 



V. Conclusion 

This paper generalized diffusion adaptation strategies to perform multi-objective optimization in a 
distributed manner over a network of nodes. We use constant step-sizes to endow the network with 
continuous learning and adaptation abilities via local interactions. We analyzed the mean- square-error 
performance of the diffusion strategy, and showed that the solution at each node gets arbitrarily close to 
the same Pareto-optimal solution for small step-sizes. 

Appendix A 
Properties of the Operators 

Properties 1-3 are straightforward from the definitions of T^( ) and P[ ]. We therefore omit the proof 
for brevity, and start with property 4. 



(Property 4: Convexity) 

We can express each x 1 block vector x^^^ in the form x^^^ — co\{x^^\ . . . , x^^} for = 1, . . . , A^. 
Then, the convex combination of x^^\ . . . , x^^^ can be expressed as 



k=l I k=l k=l J 



P 



E 

k=l 



ai X 



(k) 



[ k=l k=l J 

r K K \ 



k=l 



k=l 



K 



X 



k=i 



(118) 



According to the definition of the operator P[], and in view of the convexity of || • |p, we have 

K 



(119) 



"^The power of the fixed point error per node is defined as -^\\woo — 1 w^W'^ = X^fcLi ll^fc,oo — '^'^H^- 
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(Property 5: Additivity) 

By the definition of P[-] and the assumption that Kx^yk = for each A: = 1, . . . , A^, we obtain 

EP[x + 2/] =col{E||xi+yif , E\\xN + yNf} 

= col{E||xif +E||2/if , E||x7vf + E||2/7vf } 
= EP[x]+EP[y] (120) 
(Property 6: Variance Relations) 



We first prove ( [3T] ). From the definition of Ta{-) in ( fTS) ) and the definition of P[ ] in ( |20l ), we express 



P[Ta{x)] = coU 5^ an 



AT 2 ^ 

^/ ' • • • ' ^^^^^ 
1=1 1=1 



(121) 



Since || • |p is a convex function and each sum inside the squared norm operator is a convex combination 
of xi, . . . , xat (A^ is right stochastic), by Jensen's inequality |2j p. 77], we have 

{N N ^ 

^aii\\xif, . . . , ^aiN\\xif > 



= A^col{||xif , 



(122) 



Next, we proceed to prove ([32]). We need to call upon the following useful lemmas from (4} p. 24], and 
Lemmas 1-2 in | |21J , respectively. 



Lemma 4 (Mean- Value Theorem). For any twice-dijferentiable function /(•), it holds that 



^fiy) = v/(x) + 



^ V^f[x + tiy-x) 



dt 



(y-x) 



(123) 



where V^/(-) denotes the Hessian of f{-), and is a symmetric matrix. 



Lemma 5 (Bounds on the Integral of Hessian). Under Assumption\l\ the following bounds hold for any 
vectors x and y: 



A/,min/M < / ^IMX + ty)dt < Xi 

Jo 

I - fJ,ky2'^lk / Vl,Ji{x + ty)dt 

1=1 Lio 



(124) 
(125) 



where \\ • || denotes the 2— induced norm, and o-k^min o-k.ma^ were defined in ([34|)-(|35|). 



August 14, 2012 



DRAFT 



30 



By the definition of tiie operator Tg{-) in ([19]) and the expression ( |123| l, we express Tg{x) — Tciy) as 



Tcix) - Tciy) 



N 1 

1=1 



(xi-yi) 



N 1 

^M-A^Af V^Qiv/ '^l,Jl{yN + t{xN-yN))dt 
1=1 



(xN-yN) 



(126) 



Therefore, using ( |125| l and the definition of P[ ] in ( |20l ), we obtain 

P[Tg{x) - Tciy)] ^ col {7? . ||xi - yif, . . . ,7^ • H^at - yjvf } 

= r2p[x - 1/] 

(Property 7: Block Maximum Norm) 

According to the definition of P[ ] and the definition of block maximum norm pT|, we have 



(127) 



\P[x] 



COl{||xi|p, . . . , IIXA^IP} 

max ||xz.|P 

l<k<N 



max \\Xh\ 

l<k<N 



ll^l|2 



(128) 



(Property 8: Preservation of Inequality) 



To prove Fx ^ Fy, it suffices to prove ^ F{y — x). Since x ^ y, we have ^ y — x, i.e., all entries 
of the vector y — x are nonnegative. Furthermore, since all entries of the matrix F are nonnegative, the 
entries of the vector F{y — x) are all nonnegative, which means ^ F{y — x). 



It suffices to show that 



Appendix B 
Bias at Small Step-Sizes 



^. \\l(^w''-Woo\ 

nm 



(129) 



where ^ is a constant independent of //max- It is known that any matrix is similar to a Jordan canonical 
form [ [36| . Hence, there exists an invertible matrix Y such that A^A^ — YJY~^, where J is the Jordan 
canonical form of the matrix A^Aj, and the columns of the matrix Y are the corresponding right 
principal vectors of various degrees [36, pp. 82-88]; the right principal vector of degree one is the right 
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eigenvector. Obviously, the matrices J and Y are independent of /Xmax- Using the Kronecker product 



property |36 p. 140]: {A ® B){C ® D) ^ AC ® BD, we obtain 



^\ — A2 A 



T aT , 



2 ^1 



{Y®Im){J®Im){Y-^®Im) 



(130) 



Denote jj^k = where /3k is some positive scalar such that < /3/e < 1. Substituting (130) into 



( |79l ), we obtain 

1 ® — Woo 

where 



{Y Im) [Imn - J ® Im + Mmax^^]"' {Y'' ® /m)^2 A^CV (131) 



E = (y-i lM)A^MonooAj{Y /m) 
A^o = A^/^max = diag{/3i, . . . , /^at} 0/m 

^ V ' 

= ^0 



(132) 
(133) 



By ([8]), the matrix A^A[ is right- stochastic, and since A2AJ is regular, it will have an eigenvalue of 
one that has multiplicity one and is strictly greater than all other eigenvalues fSl]. Furthermore, the 
corresponding left and right eigenvectors are 9^ and 1, with 9^ y (all entries of the row vector 9^ are 
real positive numbers). For this reason, we can partition J, Y~^ and Y in the following block forms: 



J = diag{l, Jo}, Y = col 



Yr , Y = [1Yl] 



(134) 



where Jo is an (A'' — 1) x (N — l) matrix that contains the Jordan blocks of eigenvalues strictly within unit 
circle, i.e., p( Jo) < 1. The first row of the matrix Y~^ in ( |134| ) is normalized by 6^1 so that Y~^Y — I. 
(Note that Y~^Y = I requires the product of the first row of and the first column of Y to be one: 



^Tjt — 1.) Substituting these partitionings into ( |132| l, we can express E as 

En Ei2 



E 



E21 E22 



(135) 



where 



E 



11 



E 



12 



ImU2-Mo^oo^1 (10 Jm) 



Im U2 MoTZc^Ai {Yl Im) 



E21 = (Yr lM)A^MonooAj{l ® /. 



M 



(136) 

(137) 
(138) 
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E22 = {Yr lM)A^MonooAj(YL Im) 



(139) 



Observe that the matrices En, Eu, E21 and £^22 are independent of //max- Substituting ( |134| ) and ( |135| ) 



into ( |131| ), we obtain 
Let us denote 



Mmax^'ll Mmax£'l2 
Mmax£^21 / — Jo ® /M + Mmax£^22 



-1 r 



(140) 



Gil 


G12 




G21 


G22 





(141) 



(142) 



Mmax^ll Mmax^l2 
Mmax£'21 I- Jo® /M + Mmax£^22 

Furthermore, recalling that it;^ is the minimizer of the global cost function Q, we have 

AT 

5^v^JK^") = o ^ (1^0/^)^ = 

which, together with condition ( [80| ), implies that 

(0^ /m)^2 A^cV = {e^A^nc^ /m)^^ 

= co(1^0/m)^^ 
= 

where we also used the facts that A2 = A2 ® Im, = ® Im, M = ^ ® Im and the Kronecker 
product property: {A ® B){C ® D). Substituting ( |141D and ( |143D into ( |140D and using ( |133D lead to 

G12 



(143) 



G 



22 



(144) 



1 ® W"" - Woo = Mmax ' ® /m) 

To proceed with analysis, we need to evaluate G12 and ^22- We call upon the relation from |[36j pp.48]: 

(145) 



P Q 


-1 


p_i ^ p-iQ5f/p-i -p-^QS 


U V 




-sup-^ S 



where 5 = (y — UP ^Q) ^. To apply the above relation to ( |141| ), we first need to verify that £^11 is 
invertible. By definition ( |136| ), 



Ell = [-^Ai fto /Mj7^oo(^l 1 

= (Z^0/M)7^oo(l«)/ 
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N N 



(146) 



where denotes the fcth entry of the vector z = QoA20/9^1 (note that all entries of z are non-negative, 



i.e., Zk > 0). Recall from ( [73] ) that Hik^oo is a symmetric positive semi-definite matrix. Moreover, since 
Zk and cik are nonnegative, we can conclude from ( |146| ) that is a symmetric positive semi-definite 
matrix. Next, we show that En is actually strictly positive definite. Applying ( |124| ) to the expression of 
Hik^oo in ([73]), we obtain i7//e,oo > Mm^^Im- Substituting into ( |146[ ) gives: 

AT AT 



^11 > 



M 




(147) 

Noting that the matrices fio and have nonnegative entries with some entries being positive, and that 
all entries of 9 are positive, we have (l^fio^2^)/(^^l) > 0. Furthermore, by Assumption [l] we know 
YldLi^ik^M-^ > for each k = 1,...,A^. Therefore, we conclude that En > and is invertible. 
Applying ( |145| ) to ( |141[ ), we get 



Gi2 = — £^]^]^''^£'i2G22 

G22 = [^ — -^0 ® ^M + Mmax(^22 — £'21^11''" £^12)] 

Substituting ( [149] ) into ( [144] ) leads to 



(148) 
(149) 



1 



M) 



"^11 ^12 
/ 



(150) 



Substituting expression ( |150[ ) into the left-hand side of ( |129[ ), we get 

11 tt;^ — It; 00 1 



lim 

Mmax^O 



Me 



= lim 

/imax^O 











/ 



(151) 



Observe that the only term on the right-hand side of ( [151] ) that depends on //max is ^22- From its 
expression ( [149] ), we observe that, as //max 0, the matrix G22 tends to {I — ® Im)~^, which is 
independent of /Xmax- Therefore, the limit on the right-hand side of ( |151[ ) is independent of //max- 
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